In this project, we use R and apply exploratory data analysis (EDA) techniques to explore the dataset of wine quality and physicochemical properties. The objective is to explore which chemical properties influence the quality of red wines. And we also produce refined plots to illustrate interesting relationships in the data. The background information of the data is available at this link and descriptions of data is here.
We’ll start with the data structure first. The dataset contains 13 variables and 1599 observations. For each variable, we also have its descriptive statistics for initial observations.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
X and quality are discrete variables. All other variables seem to be continuous numerical quantities. From the variable names and descriptions, it appears that fixed.acidity ~ volatile.acidity and free.sulfur.dioxide ~ total.sulfur.dioxide may have correlations with each other.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Since we are primarily interested in quality variable, it would also be interesting to notice the basic statistics on that as well. quality is an ordered, categorical, discrete variable. From the literature, this was on a 0-10 scale, and was rated by at least 3 wine experts. The values ranged only from 3 to 8, with a mean of 5.6 and median of 6.
We’ll draw quick histograms for these 12 variables and see the pattern for each distributions.
We also draw Boxplots for each variables as another indicator of the distributions.
The dataset contains 13 variables and 1599 observations. It appears that density and pH are normally distributed, with few outliers. Fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol seem to be long-tailed.
The most interesting factor of this dataset is quality. It has a discrete range of 3-8, we can roughly see that there is normal distribution pattern. A large majority of the wines examined received ratings of 5 or 6, and very few received 3, 4, or 8.
The variables are the physicochemical attributes of red wine, so with basic chemistry background, the concentration of one chemical may have correlation with other relative chemicals or chemicals with similar components or structure. For example, there are three different acidity attributes, and as pH is defined as a numeric scale used to specify the acidity, so pH could be regarded as the characteristic of wine acidity.
For further investigation, we plan to create new ordered variable for quality, as it will be more convenient to use in the bivariate or multivariate analysis.
From the box-plots charts, we can see that all variables have outliers, and mostly outliers are on the larger side. Residual sugar and chlorides have extreme outliers. Citric acid have a large number of zero values. Alcohol has an irregular shaped distribution but it does not have pronounced outliers.
In order to see more details about the distribution trend of each variable, we can adjust the binwidth, choose proper scale or eliminate the outliers to tidy the data for a smoother visualization.
To adjust the We’ll use the statistics of the boxplot as the x-scale range, so that some of the outliers could be eliminated. Finer histograms of each variable are shown below. For variable Residual_Sugar and choloride, as it is long-tail skewed, we also draw the histogram in log scale base for smoother distribution.
## [1] 4.6 7.1 7.9 9.2 12.3
## [1] 0.12 0.39 0.52 0.64 1.01
## [1] 0.00 0.09 0.26 0.42 0.79
## [1] 0.90 1.90 2.20 2.60 3.65
## [1] 0.041 0.070 0.079 0.090 0.119
## [1] 1 7 14 21 42
## [1] 6 22 38 62 122
## [1] 0.33 0.55 0.62 0.73 0.99
## [1] 0.992350 0.995600 0.996750 0.997835 1.001000
## [1] 2.93 3.21 3.31 3.40 3.68
## [1] 8.4 9.5 10.2 11.1 13.5
In order to investigate the relationship between two variables, we’d start with calculating the correlations between each variable in the database, then pick the pairs of variables with stronger correlations for further analysis.
Below is another correlation plot, the filled circle shows the strength of correlation between two variable, bigger size with darker color indicates stronger correlation, while smaller and brighter circle indicates weaker correlation.
From the correlations charts, we can see that some correlation in pairs with stronger correlations like:
fixed.acidity vs. citric.acid fixed.acidity vs. pH fixed.acidity vs. density volatile.acidity vs. citric.acid free.sulfur.dioxide vs total.sulfur.dioxide chlorides vs. sulphates alcohol vs. density quality vs. alcohol
Create new ordered quality variable for later analysis. The original quality variable is in integer format, but the new one is in categorical factor format, so that the dataset will be categorized into six groups with the label quality.
## Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
In this plot, we can see there correlation between volatile.acidity vs.alcohol, the correlation factor is -0.202288.
##
## Pearson's product-moment correlation
##
## data: wine$volatile.acidity and wine$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2488416 -0.1548020
## sample estimates:
## cor
## -0.202288
Another density/alcohol correlation plot, the y-axis of purple line represent the median density for wine with same alcohol level. Sample statistics are also shown as below.
## # A tibble: 6 × 4
## alcohol density_mean density_median n
## <dbl> <dbl> <dbl> <int>
## 1 8.40 1.0001000 1.00010 2
## 2 8.50 0.9991400 0.99914 1
## 3 8.70 0.9977500 0.99775 2
## 4 8.80 1.0024200 1.00242 2
## 5 9.00 0.9984173 0.99780 30
## 6 9.05 0.9958500 0.99585 1
We also facet the data points by quality to compare the correlation in different quality level.
## wine$quality_factor: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## wine$quality_factor: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality_factor: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality_factor: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality_factor: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality_factor: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## wine$quality_factor: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wine$quality_factor: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wine$quality_factor: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wine$quality_factor: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wine$quality_factor: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wine$quality_factor: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
Wine quality is correlated with level of alcohol and volatile acidity. When volatile acidity decreases, the wine quality increases. For alcohol level of wine, wine quality increase as alcohol level increases, but this trend is not dominating for wine of quality 3,4.
There is strong correlation between acidity variables, pH and acidity, free.sulfur.dioxide and total.sulfur.dioxide.
The strongest correlation is between fixed.acidity and citric.acid, with correlation factor 0.6717.
##
## Pearson's product-moment correlation
##
## data: wine$fixed.acidity and wine$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
As we have examined before in the bivariate section, we know that quality has strong correlation with volatile.acidity and alcohol. In this section, we mapped dotted chart for volatile.acidity vs.alcohol and colored the dots by its quality factor, so that in this plot, we can see there correlation between volatile.acidity vs.alcohol, the correlation factor is -0.202288, which is not a strong correlation.
As alcohol content is an important factor for wine quality, so we choose the histogram as one of the three plots. The distribution of alcohol level is relatively left-skewed and the most frequent alcohol level is 9.5.
This boxplot demonstrates the relationship between alcohol content and wine quality. Generally, higher alcohol content correlated with higher wine quality. However, the median alcohol level of wine with lower quality (3,4) are almost the same.
As the correlation tests show, wine quality was affected most strongly by alcohol and volatile acidity. And we can conclude that better wine would have relative higher alcohol content and lower volatile acidity.
Through this exploratory data analysis, we can reach the following conclusions, - Mostly frequent quality levels of red wine are 5 and 6. - When alcohol percentage decreases, density grows. - When fixed acidity increases density increases as well. - Acidity variables are strongly correlated with each other.
According to my investigation I may conclude that the key factors that determine the wine quality are alcohol content and volatile acidity level.
For future exploration of this data I would pick one category of wine (for example, quality level 3-4, 5-6, 7-8) to look at the patterns which can appear in each of these three buckets.